1 Install R package, RStudio, directory set-up

1.1 Install R and RStudio

Install R and install RStudio on your own laptop or owned computer https://posit.co/download/rstudio-desktop/ based on your operating system, available for Linux, MacOS, and Windows

Install R and install RStudio on the University computers: go to Start manu –> Software Center –> search for RStudio, and R for Windows. The current version of R in the computers here is 4.3.2

1.2 Differential expression analysis (DEA) work directory setting up

Open up RStudio and setting up a new project for DEA analysis

1.Go to the File menu and select New Project.

2.In the New Project window, choose New Directory. Then, choose New Project. Name your new directory RNA_seq_DEA and then Create the project as subdirectory of:the Desktop or OneDrive - University of Helsinki (or location of your choice).

3.The new project should automatically open in RStudio.

1

To check whether or not you are in the correct working directory, use getwd(). The path Desktop/RNA_seq_DEA should be returned to you in the console.

getwd()
## [1] "C:/Users/Zilan/OneDrive - University of Helsinki/ZilanWen Github/RNA-seq_Differential-Gene-Expression-Analysis"

To get the location of working directory For example: At Desktop, build a folder At RStudio –> Session –> Set Working Directory –> Choose Directory –> Desktop –> RNA-seq_DEA

or use code below:

setwd("C:/Users/username/Desktop/RNA-seq_DEA")

Check the current R version. The current R version is R-4.5.0

R.Version()

Within your working directory use the New folder button in the bottom right panel to create two new directories data and results.

Now we need to download the files from Moodle and the data list which looks like below
data

Then we go to the File menu and select New File, then select R markdown. This should open up a script editor in the top left hand corner. This is where we will be typing and saving all commands required for this analysis. Now save the file as DEA_script.R. When finished your working directory should now look similar to this:
directories

data for the original data, results for generated files and figures.

2 GitHub

GitHub is an online software development platform. It’s used for storing, tracking, and collaborating on software projects. It makes it easy for developers to share code files and collaborate with fellow developers on open-source projects.

Set up your GitHub: Go here https://github.com/, log in or sign up with your email.

Create your repository: Repository is the project you can create which is on the top left of your GitHub home page.To create a new repository, you can click the green button New, then name the new repository. Here please go to 2.1 to follow the instruction to type a name for your repository username.github.io, so that you can create a github page later.

Clone a repository from other GitHUb accounts Forking a repository can make a copy of a repository from other account to your account.
fork

2.1 How to create a Github pages site?

The instruction of creating a GitHub pages site

  1. If you want to create a site in a new repository.
  • In the upper-right corner of any page, select ´+´, then click New repository.
    createNewrepp
  1. Use the Owner dropdown menu to select the account you want to own the repository.

  2. Type a name for your repository and an optional description. If you’re creating a user or organization site, your repository must be named <user>.github.io or <organization>.github.io. If your user or organization name contains uppercase letters, you must lowercase the letters.

  • See the picture below:
    user
  1. click Public –> choose Add a README file –> click Create repository

  2. Under your repository name, click Settings–> click Pages in the “Code and automation” section –> Click Visit site to see your publish site

  • But you can’t find the site now, clear brower cache and reload GitHub. It can take up to 10 minutes for changes to your site to publish after you push the changes to GitHub.
    nopageyet
  1. Rename the repository name as RNA_seq_DEA in General in Setting

  2. Use the Github desktop version: We are going to download and upload by commiting changes from our local files to the cloud. Go to GitHub Desktophttps://desktop.github.com/, download based on your operating system. Then go to sign in with your GitHub account.Go back to your GitHUb repository, and click open with Github Desktop
    click open with Github Desktop

  3. Clone a repository and connect repository to local files In local path, choose where we want to save the repository in our local computer.click Clone. Now we have cloned the repository that we created on GitHub and we have connected it to the GitHub desktop application.
    connect

  4. Upload data, results files in the local path of your repository

  5. Push local files to GitHub cloud Any changes and updated files can be uploaded to GitHUb cloud though GitHub Desktop.After change files, go back to GitHub Desktop and navigate the corresponding repository, you will have the changes listed on the top left. We now give the change a title such as ‘11’ in thesummary (required)box and click Commit to main on the bottom left, Then on the right, we can push comments to the origin remote by clicking Push origin.
    push

  6. Github page link could be visible now. Go to GitHub website –> the setting –> Pages. You can now find the link of the Github page.
    pagevisible

If you want to create a site in an existing repository, you can rename the respositoty as username.github.io. Rename to what you want after you get the link.

  1. The page generated by knit to HTML is not visible If we click to visit site, it would be like this:
    pagecontent

  2. Upload the .rmd file and the .html file into GitHub local path, rename the files as index.rmd, index.html, push to github cloud. In order to make the knit function well, please change the path-related code into the right local github path in your .rmd file. After a while, you will get the right contents. When click the visiting site, now it shows like the picture below:
    content

3 Assignment

You maybe have a few of questions about the RNA-Seq tutorial assignment,

  • Is R markdown documents the preferred format?

  • Is there also an option to turn in the R scripts and associated plots?

  • What is the expected contents of results to be included in the submission?

  • Are we supposed to include both the code and the results that come from running it along with all of the different plots?

  • What is(/are) the thing(s) to be demonstrated when grading this assignment?

  • What is the deadline for the submission of this assignment?

Q: Is R markdown documents the preferred format?

A: The preferred formats is to push all the files into github cloud as public repository and create a github page site which links to your knit to html file from R markdown. You can then just send your GitHub page link to the email zilan.wen@helsinki.fi. All the files in your repository include results, data, R markdown file named index.rmd, the html file named index.html.
github files

Q: Is there also an option to turn in the R scripts and associated plots?

A: If you don’t feel like to github, you can also have the option that writing down data analysis report (.doc or .pdf format) which has result description with tables, figures in the main text and discussion, with supplementary other figures, tables, code files. Zip them and sent email to zilan.wen@helsinki.fi.

Q: What is the expected contents of results to be included in the submission?

A: The results include:

  • Sample level quality control
    • figure: plotMDS.Infection.png, PlotPCA_dds.pdf, PlotHeatmap_dds.pdf
  • Count distribution
    • figure: count distribution.png
  • The amount of upregulated genes and downregulated genes
    • table to show the number of upregulated genes and downregulated genes
    • figure plotSmear.InfectionRNAseq.png

Supplementary files:

  1. the table of raw counts together named ‘infection.rawCounts.csv’.

  2. the normalized counts without filter named ‘infection.normCPM.csv’

  3. Normalized counts after filtered by DESeq2, file named such as ‘normalized_counts_DESeq2.csv’

  4. Normalized counts by TMM in EdgeR, files named ‘Infection.filtered.normCPM.csv’.

  5. The up-regulated genes saved in the file named ‘lrt.Ha_vs_Ctr_UP.csv’

  6. The down-regulated genes saved in the file named ‘lrt.Ha_vs_Ctr_DW.csv’

The results list like below:
results list

Q:Are we supposed to include both the code and the results that come from running it along with all of the different plots?

A: Yes.

Q: What is(/are) the thing(s) to be demonstrated when grading this assignment?

The things to be demonstrated: Learning about the count normalization, sample-level quality control by PCA, selecting differentially expressed genes(DEGs), compare the numbers of DEGs among the different treatment groups and extract how many DEGs overlap in different conditions.

Q: What is the deadline for the submission of this assignment?

A: The deadline is 10.06.2025.

4 Count normalization

Created on 15.05.2025
@author: Zilan Wen
Email:

4.1 Normalization Overview

The first step in the DE analysis workflow is count normalization, which is necessary to make accurate comparisons of gene expression between samples.

The main factors often considered during normalization are:

  • sequencing depth or coverage.
4
4
  • Gene length

  • RNA composition

These three factors have been described very well in [DEG workshop Salmon] (https://github.com/hbctraining/DGE_workshop_salmon/blob/master/lessons/02_DGE_count_normalization.md)

4.1.1 Common normalization methods

  • CPM counts per million counts scaled by total number of reads.

  • TPMtranscripts per kilobase million counts per length of transcript (kb) per million reads mapped.

  • RPKM or RKPM reads/fragments per kilobase of exon per million reads / fragments mapped similar to TPM. DESeq2's median of ratios counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene.

  • EdgeR's trimmed mean of M values (TMM) uses a weighted trimmed mean of the log expression ratios between samples.

The differences between RPKM, FPKM and TPM are explained clearly by SatQuest video

4.1.2 DESeq2-normalized counts: Median of ratios method

This method is explained clearly in DEG_workshop_salmon Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition need to be taken into account.

4.2 Count normalization of our dataset with DESeq2

For transcriptome study, RNA samples which are isolated from control samples and treatment samples are sent to sequencing company. The following steps are :

  • RNA-seq with a sequencing depth of 10-30 M reads per library (at least 3 biological replicates per sample)

  • Aligning or mapping the quality-filtered sequenced reads to respective genome (e.g. HISAT2 or STAR or Bowtie2).

  • Quantifying reads that are mapped to genes or transcripts (e.g.HTseq)

  • Gene differential expression analysis

  • Functional analysis such as enrichment of known biological functions, interactions, or pathways

RNA samples Heterobasidion annosum is a forest pathogen which can cause root and stem rot disease on Scots pine and Norway spruce. In this experiment, two-month old Scots pine plants were inoculated with H. annosum. After one month infection, total RNA was extracted from Scots pine roots. Control plants were fungus-free. We have three biological replicates for each treatment. The treatment samples are Ha1,Ha2, Ha3 and the control samples are Ctr1, Ctr2, Ctr3. Since genome of Scots pine is not available, we align reads to Pinus taeda genome which can be found in plantgenie https://plantgenie.org/.

In this tutorial we start with raw reads count ´.txt´ files, normalize the counts using DESeq2.This requires a few steps:

    1. Install packages DESeq2, EdgeR
    1. Build a file description fileDesc.txt file for all .txt files about counts matrix,including file path, inoculation, and description.
    1. Read data files.
    1. Build meta data. meta file is similar to filedesc, bnut we use meta for create DESeqDataset object. Ensure the row names of the metadata dataframe are present and in the same order as the column names of the counts data frame.
    1. Create DESeqDataset object
    1. Generate size factors
    1. Generate the normalized counts

4.2.1 Install packages

Install DESeq2and packageEdgeR

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2",force = TRUE)
BiocManager::install("edgeR",force = TRUE)

4.2.2 Build a file description

Here the file description called fileDesc is already build and save in data folder. We will use it for reading a list of seperated files.

4.2.3 Read data files

Set to the current work directory

setwd('C:/Users/Zilan/OneDrive - University of Helsinki/ZilanWen Github/RNA-seq_Differential-Gene-Expression-Analysis')

Now we start to read raw data

Coinfection.targets<-read.delim("./data/fileDesc.txt")

change the rownames of the dataframe Coinfection.targets, later on we don’t need to rename column names.

rownames(Coinfection.targets)<-c("Ha1","Ha2","Ha3","Ctr1","Ctr2","Ctr3")

load the packege edgeRand DESeq2

library(edgeR)
library(DESeq2)

read the six .txt files listed in Coinfection.targets.

Coinfection.orig <- readDGE(Coinfection.targets, header=F)

check the dimension of the data set

dim(Coinfection.orig)
## [1] 51751     6

chreck the first 6 rows of the data

head(Coinfection.orig)
## An object of class "DGEList"
## $samples
##                files inoculation         description group lib.size
## Ha1   ./data/Ha1.txt          Ha H.annosum infection     1 12510293
## Ha2   ./data/Ha2.txt          Ha H.annosum infection     1  8033676
## Ha3   ./data/Ha3.txt          Ha H.annosum infection     1 10094630
## Ctr1 ./data/Ctr1.txt         Ctr             Control     1 10763763
## Ctr2 ./data/Ctr2.txt         Ctr             Control     1 10539029
## Ctr3 ./data/Ctr3.txt         Ctr             Control     1 12251172
##      norm.factors
## Ha1             1
## Ha2             1
## Ha3             1
## Ctr1            1
## Ctr2            1
## Ctr3            1
## 
## $counts
##             Samples
## Tags         Ha1 Ha2 Ha3 Ctr1 Ctr2 Ctr3
##   PITA_19277 274 169 195   90   80  329
##   PITA_36893   0   0   0    0    0    0
##   PITA_43071 157 101 150  128  133  127
##   PITA_31484   0   0   0    0    0    0
##   PITA_22913   0   0   0    0    0    0
##   PITA_43120  62  42  22   53   24   15

Extact counts dataframe

Coinfection.rawCount <- Coinfection.orig$count
dim(Coinfection.rawCount)
## [1] 51751     6
head(Coinfection.rawCount)
##             Samples
## Tags         Ha1 Ha2 Ha3 Ctr1 Ctr2 Ctr3
##   PITA_19277 274 169 195   90   80  329
##   PITA_36893   0   0   0    0    0    0
##   PITA_43071 157 101 150  128  133  127
##   PITA_31484   0   0   0    0    0    0
##   PITA_22913   0   0   0    0    0    0
##   PITA_43120  62  42  22   53   24   15

4.2.4 Build meta data

We define sampletype: we only have two types of samples, Control and Ha-infection

sampletype <- factor(c(rep("Ha",3), rep("Ctr", 3)))

Build meta data frame

meta <- data.frame(sampletype, row.names = colnames(Coinfection.orig$count))

check the column name of counts dataframe

colnames(Coinfection.orig$count)
## [1] "Ha1"  "Ha2"  "Ha3"  "Ctr1" "Ctr2" "Ctr3"

check the rowname of meta dataframe

rownames(meta)
## [1] "Ha1"  "Ha2"  "Ha3"  "Ctr1" "Ctr2" "Ctr3"

Check that sample names match in both files

all(colnames(Coinfection.orig$count) %in% rownames(meta))
## [1] TRUE

4.2.5 Create DESeqDataset object

load the package DESeq2 Create DESEq2 dataset dds

library(DESeq2)
dds <- DESeqDataSetFromMatrix(Coinfection.orig, colData = meta, design = ~ sampletype)

If you want to view dds, you can use the code:

head(counts(dds))
##             Samples
## Tags         Ha1 Ha2 Ha3 Ctr1 Ctr2 Ctr3
##   PITA_19277 274 169 195   90   80  329
##   PITA_36893   0   0   0    0    0    0
##   PITA_43071 157 101 150  128  133  127
##   PITA_31484   0   0   0    0    0    0
##   PITA_22913   0   0   0    0    0    0
##   PITA_43120  62  42  22   53   24   15

4.2.6 Generate size factors

To perform the median of ratios method of normalization, DESeq2 has a single estimateSizeFactors() function that will generate size factors.

dds <- estimateSizeFactors(dds)

take a look at the normalization factor applied to each sample

sizeFactors(dds)
##       Ha1       Ha2       Ha3      Ctr1      Ctr2      Ctr3 
## 1.1734464 0.7897579 0.9805609 1.0611942 0.9794690 1.1512821

4.2.7 Generate the normalized counts

Now, to retrieve the normalized counts matrix from dds, we use the counts() function and add the argument normalized=TRUE.

normalized_counts <- counts(dds, normalized=TRUE)

Save the normalized_countsinto results to your local path.

write.csv(normalized_counts, file="./results/coinfection_normalized_counts_DESeq2.csv")

5 Sample-level quality Control

5.1 Principal Component Analysis (PCA)

5.1.1 PCA overview

A useful initial step in an RNA-seq analysis is often to assess overall similarity between samples:

  • Which samples are similar to each other, which are different?

  • Does this fit to the expectation from the experiment’s design?

  • What are the major sources of variation in the dataset?

To explore the similarity of our samples, we will be performing sample-level QC using Principal Component Analysis (PCA) and hierarchical clustering methods. These methods/tools allow us to check how well similar the replicates are to each other (clustering) and to make sure that the experimental condition is the major source of variation in the data. Sample-level QC can also help identify any samples behaving like outliers; we can further explore any potential outliers to determine whether they need to be removed prior to DE analysis.

These unsupervised clustering methods are run using log2 transformed normalized counts. The log2 transformation improves the distances/clustering for visualization. Instead of using an ordinary log2 transform, we will be using regularized log transform (rlog), to avoid any bias from the abundance of low-count genes; Note1 below explains this in more detail.

For more details, please go to DEG workshop Salmon

5.1.2 Using our dataset for PCA

  • Transform normalized counts for the dataset

To improve the distances/clustering for the PCA and heirarchical clustering visualization methods, we need to moderate the variance across the mean by applying the rlog transformation to the normalized counts.

Transform counts for data visualization

rld <- rlog(dds, blind=TRUE)

The blind=TRUE argument is to make sure that the rlog() function does not take our sample groups into account - i.e. does the transformation in an unbiased manner. When performing quality assessment, it is important to include this option. The DESeq2 vignette has more details about this.

The rlog() function returns a DESeqTransform object, another type of DESeq-specific object. The reason you don’t just get a matrix of transformed values is because all of the parameters (i.e. size factors) that went into computing the rlog transform are stored in that object. We use this object to plot the PCA and heirarchical clustering figures for quality assessment.

  • start with PCA

The function plotPCA() requires two arguments as input: a DESeqTransform object and the “intgroup” (interesting group), i.e. the name of the column in our metadata that has information about the experimental sample groups.

plotPCA(rld, intgroup="sampletype")
## using ntop=500 top features by variance

If we want to save PDF file on own computer:

pdf("./results/PlotPCA_dds.pdf")
plotPCA(rld, intgroup="sampletype")
dev.off()
  • Exercise:

    1. What does the above plot tell you about the similarity of samples?

    2. Does it fit the expectation from the experimental design?

    3. What do you think the %variance information (in the axes titles) tell you about the data in the context of the PCA?

5.2 Hierarchical Clustering Heatmap

Similar to PCA, hierarchical clustering is another, complementary, method for identifying strong patterns in a dataset and potential outliers. The heatmap displays the correlation of gene expression for all pairwise combinations of samples in the dataset. Since the majority of genes are not differentially expressed, samples generally have high correlations with each other (values higher than 0.80). Samples below 0.80 may indicate an outlier in your data and/or sample contamination.

There is no built-in function in DESeq2 for plotting the heatmap for diplaying the pairwise correlation between all the samples and the hierarchical clustering information; we will use the pheatmap() function from the pheatmap package. This function cannot use the DESeqTransform object as input, but requires a matrix or dataframe. So, the first thing to do is retrieve that information from the rld object using a function called assay() (from the SummarizedExperiment package) that converts the data in a DESeqTransform object to a simple 2-dimensional data structure (a matrix in this case).

Extract the rlog matrix from the object using assay function which is part of the SummarizedExperiment package which is a DESeq2 dependency and is loaded with the DESeq2 library

rld_mat <- assay(rld)

Next, we need to compute the pairwise correlation values for all the samples. We can do this using the cor() function:

rld_cor <- cor(rld_mat) 

check the output of cor(), make note of the row names and column names

head(rld_cor)
##            Ha1       Ha2       Ha3      Ctr1      Ctr2      Ctr3
## Ha1  1.0000000 0.9974935 0.9978500 0.9956639 0.9956698 0.9937710
## Ha2  0.9974935 1.0000000 0.9978706 0.9961065 0.9957914 0.9953450
## Ha3  0.9978500 0.9978706 1.0000000 0.9956039 0.9954369 0.9943809
## Ctr1 0.9956639 0.9961065 0.9956039 1.0000000 0.9967852 0.9946562
## Ctr2 0.9956698 0.9957914 0.9954369 0.9967852 1.0000000 0.9943313
## Ctr3 0.9937710 0.9953450 0.9943809 0.9946562 0.9943313 1.0000000

We will use meta for annotation argument later, let’s check meta :

head(meta)
##      sampletype
## Ha1          Ha
## Ha2          Ha
## Ha3          Ha
## Ctr1        Ctr
## Ctr2        Ctr
## Ctr3        Ctr

Install pheatmap package and load the package

install.packages("pheatmap")

?pheatmap # help page for pheatmap

Plot heatmap using the correlation matrix and the metadata object

library(pheatmap)
pheatmap(rld_cor, annotation = meta)

If we want to change color, we can try the codes below:
R color in Brewer’s palettes can be available the link https://r-graph-gallery.com/38-rcolorbrewers-palettes

pheatmap function instruction are also available [pheatmap] (https://www.rdocumentation.org/packages/pheatmap/versions/1.0.12/topics/pheatmap)

heat.colors <- RColorBrewer::brewer.pal(6, "Blues")
pheatmap(rld_cor, annotation = meta, color = heat.colors, border_color=NA, fontsize = 10, 
        fontsize_row = 10, height=20)

If you want to save the heatmap in local path in your computer:

pdf("./results/PlotHeatmap_dds.pdf")
heat.colors <- RColorBrewer::brewer.pal(6, "Blues")
pheatmap(rld_cor, annotation = meta, color = heat.colors, border_color=NA, fontsize = 10, 
        fontsize_row = 10, height=20)

6 Differential expression analysis (DEA) using EdgeR

Differential expression analysis can be carried out using DESeq2 or using EdgeR. The DESeq2 methods has been describes in DGE workshop Salmon.

Here we will learn to use another method – EdgeR

Actual raw integer read counts (un-normalized) are then used for DGE analysis using edgeR. edgeR prefers the raw integer read counts.

To know the current directory

getwd()

Navigate to your work directory, for example: At RStudio –> Session –> Set Working Directory –> Choose Directory –> Desktop –> RNA-seq_DEA

or use code below:

setwd("C:/Users/username/Desktop/RNA-seq_DEA")

6.1 Read the individual file by a meta file called fileDesc.txt

load EdgeR package

library(edgeR)
options(digits=3)

To tells R where the data files are Check the data saving path in the fileDesc.txt, which could be same path as the individuals.

infection.targets<-read.delim("./data/fileDesc.txt")

To check Coinfection.targets

infection.targets
##             files inoculation         description
## 1  ./data/Ha1.txt          Ha H.annosum infection
## 2  ./data/Ha2.txt          Ha H.annosum infection
## 3  ./data/Ha3.txt          Ha H.annosum infection
## 4 ./data/Ctr1.txt         Ctr             Control
## 5 ./data/Ctr2.txt         Ctr             Control
## 6 ./data/Ctr3.txt         Ctr             Control

change the rawnames of the dataframe Coinfection.targets, later on we don’t need to rename column names.

rownames(infection.targets)<-c("Ha1","Ha2","Ha3","Ctr1","Ctr2","Ctr3")

To check Coinfection.targets again, The difference with the above one is an addtional rownames on the left side

infection.targets
##                files inoculation         description
## Ha1   ./data/Ha1.txt          Ha H.annosum infection
## Ha2   ./data/Ha2.txt          Ha H.annosum infection
## Ha3   ./data/Ha3.txt          Ha H.annosum infection
## Ctr1 ./data/Ctr1.txt         Ctr             Control
## Ctr2 ./data/Ctr2.txt         Ctr             Control
## Ctr3 ./data/Ctr3.txt         Ctr             Control

read and merges a set of text files containing gene expression counts, it makes a DGEList object directly

infection <- readDGE(infection.targets, header=F)

Check the dimension of DGElist R object

dim(infection)
## [1] 51751     6
head(infection)
## An object of class "DGEList"
## $samples
##                files inoculation         description group lib.size
## Ha1   ./data/Ha1.txt          Ha H.annosum infection     1 12510293
## Ha2   ./data/Ha2.txt          Ha H.annosum infection     1  8033676
## Ha3   ./data/Ha3.txt          Ha H.annosum infection     1 10094630
## Ctr1 ./data/Ctr1.txt         Ctr             Control     1 10763763
## Ctr2 ./data/Ctr2.txt         Ctr             Control     1 10539029
## Ctr3 ./data/Ctr3.txt         Ctr             Control     1 12251172
##      norm.factors
## Ha1             1
## Ha2             1
## Ha3             1
## Ctr1            1
## Ctr2            1
## Ctr3            1
## 
## $counts
##             Samples
## Tags         Ha1 Ha2 Ha3 Ctr1 Ctr2 Ctr3
##   PITA_19277 274 169 195   90   80  329
##   PITA_36893   0   0   0    0    0    0
##   PITA_43071 157 101 150  128  133  127
##   PITA_31484   0   0   0    0    0    0
##   PITA_22913   0   0   0    0    0    0
##   PITA_43120  62  42  22   53   24   15

6.2 Raw count distribution

Get the raw mapped count before filtering

infection.rawCount <- infection$count
head(infection.rawCount)
##             Samples
## Tags         Ha1 Ha2 Ha3 Ctr1 Ctr2 Ctr3
##   PITA_19277 274 169 195   90   80  329
##   PITA_36893   0   0   0    0    0    0
##   PITA_43071 157 101 150  128  133  127
##   PITA_31484   0   0   0    0    0    0
##   PITA_22913   0   0   0    0    0    0
##   PITA_43120  62  42  22   53   24   15
install.packages("ggplot2")
library(ggplot2)

To get an idea about how RNA-seq counts are distributed, let’s plot a histogram of the counts for a single sample, ‘Ha1’:

ggplot(infection.rawCount) +
  geom_histogram(aes(x = Ha1), stat = "bin", bins = 200) +
  xlab("Raw expression counts") +
  ylab("Number of genes")

Export the .png file

png("./results/count distribution.png", res=300, height=1800, width=1800)
ggplot(infection.rawCount) +
  geom_histogram(aes(x = Ha1), stat = "bin", bins = 200) +
  xlab("Raw expression counts") +
  ylab("Number of genes")
dev.off()

6.3 Count normalization with TMM in EdgeR

Export raw count table into results folder

write.csv(infection.rawCount, file="./results/infection.rawCounts.csv")

Get the counts per million (TMM normalised) before filtering

infection.normCPM <- cpm(calcNormFactors(infection))
dim(infection.normCPM)
## [1] 51751     6
head(infection.normCPM)
##             Samples
## Tags           Ha1   Ha2   Ha3  Ctr1  Ctr2  Ctr3
##   PITA_19277 22.35 20.46 19.04  8.12  7.85 27.31
##   PITA_36893  0.00  0.00  0.00  0.00  0.00  0.00
##   PITA_43071 12.81 12.23 14.65 11.55 13.06 10.54
##   PITA_31484  0.00  0.00  0.00  0.00  0.00  0.00
##   PITA_22913  0.00  0.00  0.00  0.00  0.00  0.00
##   PITA_43120  5.06  5.09  2.15  4.78  2.36  1.24
write.csv(infection.normCPM, file="./results/infection.normCPM.csv")

6.4 Filter counts per million (CPM) out and get smaller size of libraries

Keep genes that are expressed at least 1 CPM in at least 3 libraries, normally it is the number of biological replicates of smaller group

infection.filtered <- rowSums(cpm(infection)>1) >=3
table(infection.filtered)
## infection.filtered
## FALSE  TRUE 
## 25742 26009

Libraries size of data BEFORE filtering

infection$samples$lib.size
## [1] 12510293  8033676 10094630 10763763 10539029 12251172

cover the original file with our filter data

Infection <- infection[infection.filtered,]

libraries size of data after filtering

colSums(Infection$counts)
##      Ha1      Ha2      Ha3     Ctr1     Ctr2     Ctr3 
## 12455638  7998136 10047983 10717381 10497878 12206822
dim(Infection)
## [1] 26009     6

Update the filtered libraries size

Infection$samples$lib.size <- colSums(Infection$counts)
Infection$samples
##                files inoculation         description group lib.size
## Ha1   ./data/Ha1.txt          Ha H.annosum infection     1 12455638
## Ha2   ./data/Ha2.txt          Ha H.annosum infection     1  7998136
## Ha3   ./data/Ha3.txt          Ha H.annosum infection     1 10047983
## Ctr1 ./data/Ctr1.txt         Ctr             Control     1 10717381
## Ctr2 ./data/Ctr2.txt         Ctr             Control     1 10497878
## Ctr3 ./data/Ctr3.txt         Ctr             Control     1 12206822
##      norm.factors
## Ha1             1
## Ha2             1
## Ha3             1
## Ctr1            1
## Ctr2            1
## Ctr3            1

6.5 Count normalization after filter

Performed normalisation with TMM method

Infection = calcNormFactors(Infection)

The libraries after normalisation

Infection$samples
##                files inoculation         description group lib.size
## Ha1   ./data/Ha1.txt          Ha H.annosum infection     1 12455638
## Ha2   ./data/Ha2.txt          Ha H.annosum infection     1  7998136
## Ha3   ./data/Ha3.txt          Ha H.annosum infection     1 10047983
## Ctr1 ./data/Ctr1.txt         Ctr             Control     1 10717381
## Ctr2 ./data/Ctr2.txt         Ctr             Control     1 10497878
## Ctr3 ./data/Ctr3.txt         Ctr             Control     1 12206822
##      norm.factors
## Ha1         0.981
## Ha2         1.021
## Ha3         1.011
## Ctr1        1.031
## Ctr2        0.969
## Ctr3        0.988

Get the counts per million (TMM normalised) after filtering

Infection.filtered.normCPM <-cpm(calcNormFactors(Infection))

Export TMM normalized count table after filtering

write.csv(Infection.filtered.normCPM, file="./results/Infection.filtered.normCPM.csv")

6.6 Experimental design

Treatment factor

group<-factor(c('Ha','Ha','Ha',"Ctr","Ctr","Ctr"))

Describe the experimental design,one factor with intercept, here Ctr is the intercept

Infection.design <- model.matrix(~group)   
rownames(Infection.design)<-colnames(Infection$counts)
Infection.design
##      (Intercept) groupHa
## Ha1            1       1
## Ha2            1       1
## Ha3            1       1
## Ctr1           1       0
## Ctr2           1       0
## Ctr3           1       0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

6.7 Sample-level quality control by multidimensional scaling (MDS)

to see if some samples are outliner. An MDS plot shows the relative similarities of the six samples.

plotMDS(Infection, main="MDS plot of RNA-Seq", labels=colnames(Infection$counts))

Export the MDS plot to a PNG file

png("./results/plotMDS.Infection.png", res=300, height=1800, width=1800)
plotMDS(Infection, main="MDS plot of Infection RNA-Seq", labels=colnames(Infection$counts))
dev.off()

6.8 Estimating the common dispersion, trended dispersion, tagwwise dispersion

Dispersion means biological coeffient of variation (BCV) squared. E.g. if genes expression typically differs from replicate to replicate by 20% its BCV is 0.2, and its dispersion is 0.04. Estimating the common dispersion

Infection <- estimateGLMCommonDisp(Infection, Infection.design)

Estimating the trended dispersion

Infection <- estimateGLMTrendedDisp(Infection, Infection.design)

Estimating the tagwwise dispersion

Infection <- estimateGLMTagwiseDisp(Infection, Infection.design)

Creating a visual representation of the mean-variance relationship and BCV-mean logCPM relationship

plotMeanVar(Infection, show.tagwise.vars=T,NBline=T)
plotBCV(Infection)

The square root of dispersion is the coefficient of biological variation (BCV). The common BCV is on the high side, considering that this is a designed experiment using genetically identical plants. The trended dispersion shows a decreasing trend with expression level. At low logCPM, the dispersions are very large indeed.

6.9 Fit DGEList and design matrix to genewise GLM

Infection.fit <- glmFit(Infection, Infection.design)
colnames(Infection.fit)
## [1] "(Intercept)" "groupHa"

Likelihood test for genes respond differently between different conditions, table of top differentially expressed tags, n specify n tags to display

lrt.Ha_vs_Ctr <- glmLRT(Infection.fit, coef=2)  # to compare Ha vs Ctr (Ha_vs_Ctr)
t1<-topTags(lrt.Ha_vs_Ctr, n=nrow(Infection))
head(t1$table)
##            logFC logCPM  LR   PValue      FDR
## PITA_45726  9.43   5.16 187 1.72e-42 4.46e-38
## PITA_40477 11.38   3.97 163 2.42e-37 3.15e-33
## PITA_34642 12.50   5.10 161 7.02e-37 6.09e-33
## PITA_21495  4.75   4.59 115 6.18e-27 4.02e-23
## PITA_07064  3.85   8.28 112 3.71e-26 1.93e-22
## PITA_20445  3.95   9.96 105 1.29e-24 5.59e-21

6.10 Extract number of differentially expressed (DE) genes

summary(decideTests(lrt.Ha_vs_Ctr, adjust.method="BH", p.value=0.05))
##        groupHa
## Down       123
## NotSig   25424
## Up         462

The UP-regulated genes (can change logFC to be more strict)

nrow(subset(topTags(lrt.Ha_vs_Ctr, n=586)$table,  logFC > 0))
## [1] 463
lrt.Ha_vs_Ctr_UP <- subset(topTags(lrt.Ha_vs_Ctr, n=586)$table, logFC > 0)

The DW-regulated genes (can change logFC to be more strict)

nrow(subset(topTags(lrt.Ha_vs_Ctr, n=586)$table,  logFC < 0))
## [1] 123
lrt.Ha_vs_Ctr_DW <- subset(topTags(lrt.Ha_vs_Ctr, n=586)$table, logFC < 0)

6.11 Differentially expressed transcripts’ tag

DEtags.lrt.Ha_vs_Ctr <- rownames(Infection)[as.logical(decideTests(lrt.Ha_vs_Ctr, adjust.method="BH", p.value=0.05))]

Export list of UP-regulated and DW-regulated transcripts

write.csv(lrt.Ha_vs_Ctr_UP, file="./results/lrt.Ha_vs_Ctr_UP.csv")
write.csv(lrt.Ha_vs_Ctr_DW, file="./results/lrt.Ha_vs_Ctr_DW.csv")

To ask all the genes label as grey color

Infection.colHavsCtr = rep('grey55', nrow(Infection))

To assign colour for DE transcripts

Infection.colHavsCtr[lrt.Ha_vs_Ctr$table$PValue < 0.05 & lrt.Ha_vs_Ctr$table$logFC >0 ] <- "red"
Infection.colHavsCtr[lrt.Ha_vs_Ctr$table$PValue < 0.05 & lrt.Ha_vs_Ctr$table$logFC <0 ] <- "blue"

Plot Smear plot with adjustment on Y-axis label

par(omi=c(0.1,0.1,0.1,0.1), las=1, cex=0.5, mgp=c(3,1,0), cex.main=1.8, cex.lab=1.4, cex.axis=1.4)
plotSmear(lrt.Ha_vs_Ctr, de.tags=DEtags.lrt.Ha_vs_Ctr, xlab="log-counts per million (logCPM)", ylab="log2-fold change (log2FC)", main="Ha infection compared to Control", pch=19, cex=0.4, smearWidth=0.5, panel.first=grid(), smooth.scatter=FALSE, ylim=c(-7,7), yaxs="i")
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a
## graphical parameter
abline(h=c(-1,1),col="dodgerblue")

Plot Smear plot with adjustment on Y-axis label and colour of DE tags changed

par(omi=c(0.1,0.1,0.1,0.1), las=1, cex=0.5, mgp=c(3,1,0), cex.main=1.8, cex.lab=1.4, cex.axis=1.4)
plotSmear(lrt.Ha_vs_Ctr, xlab="log-counts per million (logCPM)", ylab="log2-fold change (log2FC)", main="a infection compared to Control", smearWidth=0.5, pch=21, cex=0.4, deCol="red", col=Infection.colHavsCtr, ylim=c(-7,7), yaxs="i")
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a
## graphical parameter
abline(h=c(-1,1),col="dodgerblue")

Plot Smear plot with adjustment on Y-axis label and colour of DE tags changed and export as png

png("./results/plotSmear.InfectionRNAseq.png", res=300, height=1800, width=1800)
par(omi=c(0.1,0.1,0.1,0.1), las=1, cex=0.5, mgp=c(3,1,0), cex.main=1.8, cex.lab=1.4, cex.axis=1.4)
plotSmear(lrt.Ha_vs_Ctr, xlab="log-counts per million (logCPM)", ylab="log2-fold change (log2FC)", main="Ha infection compared to Control", smearWidth=0.5, pch=21, cex=0.4, deCol="red", col=Infection.colHavsCtr, ylim=c(-7,7), yaxs="i")
abline(h=c(-1,1),col="dodgerblue")
dev.off()